Active Learning: A Visual Tour

Zeel B Patel, IIT Gandhinagar, patel_zeel@iitgn.ac.in

Nipun Batra, IIT Gandhinagar, nipun.batra@iitgn.ac.in

Rise of Supervised Learning

Today, machine learning (ML) is applied to numerous fields, including, but not limited to Natural Language Processing (NLP), Computer-aided Diagnosis, Optimization, and Bioinformatics. A significant proportion of this success is due to a subset of ML called supervised learning. There are three main reasons behind the success of supervised learning (and machine learning, generally): 1) availability of massive data; 2) better algorithms; and 3) powerful computational infrastructure jointly called the AI Trinity.

Out[4]:

Labeling the Data is Expensive

Supervised learning algorithms can work with labeled data only. Though, annotating the data is expensive, because it may require: i) excessive time and manual effort ii) expensive sensors. Let us understand how expensive labeling is with a few examples.

Speech Recognition

Let us say we need to convert a speech or audio into text for an application such as subtitle generation (speech-to-text task). We have to annotate multiple segments of the audio that corresponds to some words or phrases. The following example audio of 6 seconds may require around a minute to annotate manually. Thus, annotating millions of hours of publicly available audio data is almost an impossible task.

Similarly, researchers have made efforts to detect COVID-19 from the sound of the human cough [AI4COVID-19], where again, collecting the correct labels is a non-trivial task.

Human Activity Recognition

Let us say we want to classify different human activities into different categories (shown in the figure). We need expensive sensors to monitor the alignment or motion of the human body for such tasks. In the end, we need to map the sensor data with various activities with substantial manual effort.

All the Samples are Not Equally Important

We know that increasing the training (labeled) data increases model performance. Though, all the samples do not contribute equally in improving the model. Let us understand this with a few examples.

SVC Says: Closer is Better

We will use an synthetic two-class data generated from a bivariate normal distribution for this experiment. Now, we will train a Support Vector Classifier (SVC) model on a subset of this dataset (5 data points) and visualize the decision boundary.

Support vector points help the SVC model to distinguish between various classes. Some instances in the above diagram are misclassified because we have used very few training points. Consider some candidate points A, B, C, and D from unlabeled data points. Point B or D is closer to the confusion area than Point A, or C. Thus, points B and D are more critical in improving the model if added to the train points.

Confusion in Digit Classification

Let us consider the MNIST dataset (a well-known public dataset with labeled images of digits $0$ to $9$) for the classification task. We train the logistic regression model on a few random samples of the MNIST dataset. Let us see what our model learns with a stratified set of $50$ data points ($5$ samples for each class). We show the normalized confusion matrix over the test set having $10000$ samples.

We can see from the confusion matrix that few digits have more confusion than others. For example, the digit '1' is having almost no confusion; digit '9' is confused with '7' and '4'. This is because some digits are difficult to distinguish from the model's perspective. Thus, we may need more training examples for such digits to learn them correctly. Now, we will see a regression-based example.

GP Needs 'Good' Data Points

We will consider a sine curve data with added noise. We take a few samples (8 samples) as train points, a few as candidate points, and rest as the test points. Candidate points are the potential train points.

We will fit a GPR (Gaussian Process Regressor) model to our dataset with the Matern kernel. As the model output, we get predictive mean along with predictive variance. The predictive variance shows the confidence of the model about its predictions.

We can observe that uncertainty (predictive variance) is higher at distant points from the train points. Let us consider a set {A, B, C, D}, and check if they are equally useful for the model.

One can claim that adding points {A, D} to the train set is better than adding points {B, C} from RMSE, and predictive variance. Note that adding points to the train set is the same as labeling unlabeled data and using them for training. One may either have an intelligent way to choose these 'good' points or randomly choose some points and label them. Active Learning techniques can help us determine the 'good' points, which are likely to improve our model. Now, we will discuss Active Learning techniques in detail.

The Basics of Active Learning

Wikipedia quotes the definition of Active Learning as the following,

  • 'Active learning is a special case of machine learning in which a learning algorithm can interactively query a user (or some other information source) to label new data points with the desired outputs.'

The below diagram illustrates the general flow of Active Learning.

As shown in the flow diagram, the model sends a few samples to the oracle (human annotator or data source) from an unlabeled pool or distribution for labeling. The samples are chosen intelligently by some criteria. Thus, Active Learning is also called as optimal experimental design in other words [link].

Random Baseline

An ML model can randomly sample data points and send them to an oracle for labeling. Random sampling will also eventually result in capturing the global distribution in the train points. However, Active Learning aims to improve the model by intelligently selecting the data points for labeling. Thus, Random sampling is an appropriate baseline to compare with Active Learning.

Different Scenarios for Active Learning

We have mainly three different scenarios of Active Learning:

  1. Membership Query Synthesis: In this scenario, the model has an underlying distribution of data points from where it can generate the samples. The generated samples are sent to the Oracle for labeling.
  2. Stream-Based Selective Sampling: We have a live stream of online data samples, and for each incoming sample model can choose to query for it or discard it based on some criteria.
  3. Pool-Based Sampling: In this case, we already have a pool of unlabeled samples. Based on some criteria, model queries for a few samples.

The pool-based sampling scenario is suitable for most of the real-world applications. Considering the space and time constraint, we restrict our article on the pool-based sampling only.

Pool-Based Sampling

If we already have an unlabeled data set pool, we can query the data points with the following methods:

  1. Uncertainty Sampling: We query the samples based on the model's uncertainty about the predictions.
  2. Query by Committee: In this approach, we create a committee of two or more models. The Committee queries for the samples where predictions disagree the most among themselves.

We will demonstrate each of the above strategies with examples in the subsequent sections.

Uncertainty Sampling

There are different approaches for the Classification and Regression tasks in Uncertainty sampling. We will go through them one by one with examples here.

Classification of Digits in the MNIST Dataset

The MNIST dataset is a well-known dataset having thousands of different images of digits 0 to 9. We have shown some examples here.

We will now fit the Random Forest Classifier model (an ensemble model consisting of multiple Decision Tree Classifiers) on a few random samples (50 samples) and visualize the predictions. We will explain different ways to perform uncertainty sampling using the predictions.

Above are the model predictions in terms of probability for a few random test samples. We can use different uncertainty strategies as the following.

  1. Least confident: In this method, we choose samples for which the most probable class's probability is minimum. In the above example, sample 1 is least confident about its highest probable class digit '4'. So, we will choose sample 1 among all for labeling using this approach.

  2. Margin sampling: In this method, we choose samples for which the difference between the probability of the most probable class and the second most probable class is minimum. In the above example, sample 1 has the least margin; thus, we will choose sample 1 for labeling using this approach.

  3. Entropy: Entropy can be calculated for N number of classes using the following equation, where $P(x_i)$ is predicted probability for $i^{th}$ class. \begin{equation} H(X) = -\sum\limits_{i=0}^{N}P(x_i)log_2P(x_i) \end{equation} Entropy is likely to be more if the probability is distributed over all classes. Thus, we can say that if entropy is more, the model is more confused among all classes. For the above example, sample 2 has the highest entropy in predictions. So, we can choose the same for labeling.

We will now see the effect of Active Learning with these strategies on test data (contains 10000 samples). We will continue using the Random Forest Classifier model for this problem. We start with 50 samples as the initial train set and add 100 actively chosen samples over 100 iterations.

Out[22]:

The above animation shows individual F1-scores and overall F1-scores for all digits after each iteration. We can see that each of the strategies, except random sampling, tends to choose more samples of a digit class having a lower F1-score. Margin sampling performs better than the other strategies in terms of F1-score. Margin sampling and Least confident method easily outperform the random baseline. The entropy method, in this case, is comparable to the random baseline. The Figure below shows a comparison of all strategies.

Thus far, we have seen uncertainty for classification tasks. Now, we will take an example of regression to explain uncertainty sampling.

Regression on Noisy Sine Curve

We consider the sine curve dataset we've used earlier for this task. We fit the Gaussian Process regressor model with Matern kernel on randomly selected 5 data points. Uncertainty measure for the regression tasks is the standard deviation or the predictive variance. In this example, we will take predictive variance as our measure of uncertainty.

As per uncertainty criteria, samples with predictive variance should be queried for a label. Now, we will show a comparison of uncertainty sampling with the random baseline for ten iterations. We are also showing the next sample to query at each iteration.

Out[32]:

We are showing a comparison between uncertainty sampling and random sampling in the above animation. One can observe that uncertainty sampling-based samples are more informative to the model and ultimately help reduce model uncertainty (variance) and RMSE compared to random sampling.

Now, we will discuss the Query by Committee method.

Query by Committee (QBC)

Query by Committee approach involves creating a committee of two or more learners or models. Each of the learners can vote for samples in the pool set. Samples for which all committee members disagree the most are considered for querying. For classification tasks, we can take a mode of votes from all learners, and in regression settings, we can take average predictions from all the learners. The central intuition behind QBC is to minimize the version space. Initially, each model has different hypotheses that try to converge as we query more samples.

We can set up a committee for the QBC using the following approaches

  1. The same type of model with different hyperparameters
  2. The same model with different segments of the dataset
  3. Different types of models with the same dataset

We will explain the first approach with the SVC (Support Vector Classifier) model with RBF kernel and Iris dataset. We do not describe other approaches due to space and time constraints.

Classification on Iris Dataset

We initially train the model on six samples and Actively choose 30 samples from the pool set. We will test the model performance at each iteration on the same test set of 30 samples.

Out[49]:

Points queried by the committee are the points where learners disagree the most. This can be observed from the above plot. We can see that initially, all models learn different decision boundaries for the same data. Iteratively they converge to a similar hypothesis and thus start learning similar decision boundaries.

We now show the comparison of the overall F1-score between random baseline and our model. QBC, most of the time, outperforms the random baseline.

Few More Active Learning Strategies

There are few more Active Learning techniques we are not covering here constrained to space and time. But we describe them in brief here:

  1. Expected model change: Selecting the samples which would have the greatest change in the model.
  2. Expected error reduction: Selecting the samples likely to reduce the generalization error of the model.
  3. Variance reduction: Selecting samples that may help reduce output variance.

With this, we complete the visual tour to Active Learning techniques.

References

  1. Settles, Burr. Active learning literature survey. University of Wisconsin-Madison Department of Computer Sciences, 2009.
  2. Danka, Tivadar, and Peter Horvath. "modAL: A modular active learning framework for Python." arXiv preprint arXiv:1805.00979 (2018).